Research Problem

Last year, I read a paper titled, "Feature Selection Methods for Identifying Genetic Determinants of Host Species in RNA Viruses". This year, I read another paper titled, "Predicting host tropism of influenza A virus proteins using random forest". The essence of these papers were to predict influenza virus host tropism from sequence features. The particular feature engineering steps were somewhat distinct, in which the former used amino acid sequences encoded as binary 1/0s, while the latter used physiochemical characteristics of the amino acid sequences instead. However, the core problem was essentially identical - predict a host classification from influenza protein sequence features. Random forest classifiers were used in both papers, and is a powerful method for identifying non-linear mappings from features to class labels. My question here was to see if I could get comparable performance using a simple neural network.

Data

I downloaded influenza HA sequences from the Influenza Research Database. Sequences dated from 1980 to 2015. Lab strains were excluded, duplicates allowed (captures host tropism of certain sequences). All viral subtypes were included.

Below, let's take a deep dive into what it takes to construct an artificial neural network!

The imports necessary for running this notebook.



In [2]:

    
! echo $PATH
! echo $CUDA_ROOT









    



/home/ericmjl/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/cuda/bin:/home/ericmjl/.local/bin:/home/ericmjl/bin



In [3]:

    
import pandas as pd
import numpy as np
from Bio import SeqIO
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from collections import Counter
from sklearn.preprocessing import LabelBinarizer
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.metrics import mutual_info_score as mi

from lasagne import layers
from lasagne.updates import nesterov_momentum
from nolearn.lasagne import NeuralNet
import theano









    



Using gpu device 0: Quadro 2000

Read in the viral sequences.



In [4]:

    
sequences = SeqIO.to_dict(SeqIO.parse('20150902_nnet_ha.fasta', 'fasta'))
# sequences

The sequences are going to be of variable length. To avoid the problem of doing multiple sequence alignments, filter to just the most common length (i.e. 566 amino acids).



In [5]:

    
lengths = Counter()
for accession, seqrecord in sequences.items():
    lengths[len(seqrecord.seq)] += 1
    
lengths.most_common(1)[0][0]









    Out[5]:





566

There are sequences that are ambiguously labeled. For example, "Environment" and "Avian" samples. We would like to give a more detailed prediction as to which hosts it likely came from. Therefore, take out the "Environment" and "Avian" samples.



In [6]:

    
# For convenience, we will only work with amino acid sequencees of length 566.
final_sequences = dict()
for accession, seqrecord in sequences.items():
    host = seqrecord.id.split('|')[1]
    if len(seqrecord.seq) == lengths.most_common(1)[0][0]:
        final_sequences[accession] = seqrecord

Create a numpy array to store the alignment.



In [7]:

    
alignment = MultipleSeqAlignment(final_sequences.values())
alignment_array = np.array([list(rec) for rec in alignment])

The first piece of meat in the code begins here. In the cell below, we convert the sequence matrix into a series of binary 1s and 0s, to encode the features as numbers. This is important - AFAIK, almost all machine learning algorithms require numerical inputs.



In [8]:

    
# Create an empty dataframe.
# df = pd.DataFrame()

# # Create a dictionary of position + label binarizer objects.
# pos_lb = dict()

# for pos in range(lengths.most_common(1)[0][0]):
#     # Convert position 0 by binarization.
#     lb = LabelBinarizer()
#     # Fit to the alignment at that position.
#     lb.fit(alignment_array[:,pos])
#     # Add the label binarizer to the dictionary.
#     pos_lb[pos] = lb
#     # Create a dataframe.
#     pos = pd.DataFrame(lb.transform(alignment_array[:,pos]))

#     # Append the columns to the dataframe.
#     for col in pos.columns:
#         maxcol = len(df.columns)
#         df[maxcol + 1] = pos[col]
from isoelectric_point import isoelectric_points
df = pd.DataFrame(alignment_array).replace(isoelectric_points)



In [9]:

    
# Add in host data
df['host'] = [s.id.split('|')[1] for s in final_sequences.values()]
df = df.replace({'X':np.nan, 'J':np.nan, 'B':np.nan, 'Z':np.nan})
df.dropna(inplace=True)
df.to_csv('isoelectric_point_data.csv')



In [10]:

    
# Normalize data to between 0 and 1.
from sklearn.preprocessing import StandardScaler

df_std = pd.DataFrame(StandardScaler().fit_transform(df.ix[:,:-1]))
df_std['host'] = df['host']



In [11]:

    
ambiguous_hosts = ['Environment', 'Avian', 'Unknown', 'NA', 'Bird', 'Sea_Mammal', 'Aquatic_Bird']

unknowns = df_std[df_std['host'].isin(ambiguous_hosts)]

train_test_df = df_std[df_std['host'].isin(ambiguous_hosts) == False]
train_test_df.dropna(inplace=True)









    



/home/ericmjl/anaconda3/lib/python3.4/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

With the cell above, we now have a sequence feature matrix, in which the 566 amino acids positions have been expanded to 6750 columns of binary sequence features.

The next step is to grab out the host species labels, and encode them as 1s and 0s as well.



In [12]:

    
set([i for i in train_test_df['host'].values])









    Out[12]:





{'American_Black_Duck',
 'American_Green_Winged_Teal',
 'American_Widgeon',
 'American_Wigeon',
 'Bald_Eagle',
 'Bantam',
 "Bewick'S_Swan",
 'Black_Duck',
 'Black_Headed_Gull',
 'Black_Scoter',
 'Blue_Winged_Teal',
 'Blue_Winged_Teal_',
 'Bufflehead',
 'Canada_Goose',
 'Canvasback',
 'Chicken',
 'Cinnamon_Teal',
 'Common_Coot',
 'Common_Eider',
 'Common_Goldeneye',
 'Common_Murre',
 'Dog',
 'Domestic_Cat',
 'Double_Crested_Cormorant',
 'Duck',
 'Dunlin',
 'Eurasian_Teal',
 'Eurasian_Wigeon',
 'Ferret',
 'Gadwall',
 'Garganey',
 'Glaucous_Gull',
 'Goose',
 'Greater_White_Fronted_Goose',
 'Grebe',
 'Green_Winged_Teal',
 'Guineafowl',
 'Gull',
 'Herring_Gull',
 'Hooded_Merganser',
 'Human',
 'Knot',
 'LAB',
 'Large_Cat',
 'Laughing_Gull',
 'Lesser_Scaup',
 'Long_Tailed_Duck',
 'Mallard',
 'Mallard_Black_Duck_Hybrid',
 'Mallard_Duck',
 'Mew_Gull',
 'Mink',
 'Murre',
 'Muscovy_Duck',
 'Northern_Pintail',
 'Northern_Shoveler',
 'Ostrich',
 'Pacific_Golden_Plover',
 'Panda',
 'Pelican',
 'Pig',
 'Pigeon',
 'Pink_Footed_Goose',
 'Pintail',
 'Poultry',
 'Quail',
 'Red_Crested_Pochard',
 'Red_Knot',
 'Red_Necked_Stint',
 'Redhead',
 'Redhead_Duck',
 'Ring_Billed_Gull',
 'Ring_Necked_Duck',
 'Rosy_Billed_Pochard',
 'Ruddy_Turnstone',
 'Sanderling',
 'Semipalmated_Sandpiper',
 'Shelduck',
 'Shorebird',
 'Shoveler',
 'Sloth_Bear',
 'Snow_Goose',
 'Sparrow',
 'Swan',
 'Swine',
 'Teal',
 'Tufted_Duck',
 'Turkey',
 'Velvet_Scoter',
 'Waterfowl',
 'Weasel',
 'White_Fronted_Goose',
 'White_Rumped_Munia',
 'Widgeon',
 'Wood_Duck'}



In [13]:

    
# Grab out the labels.
output_lb = LabelBinarizer()
output_lb.fit(train_test_df['host'])
Y = output_lb.fit_transform(train_test_df['host'])
Y = Y.astype(np.float32)  # Necessary for passing the data into nolearn.
Y.shape









    Out[13]:





(25414, 95)



In [14]:

    
X = train_test_df.ix[:,:-1].values
X = X.astype(np.float32)  # Necessary for passing the data into nolearn.
X.shape









    Out[14]:





(25414, 566)

Next up, we do the train/test split.



In [15]:

    
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

For comparison, let's train a random forest classifier, and see what the concordance is between the predicted labels and the actual labels.



In [84]:

    
rf = RandomForestClassifier()
rf.fit(X_train, Y_train)
predictions = rf.predict(X_test)
predicted_labels = output_lb.inverse_transform(predictions)
# Compute the mutual information between the predicted labels and the actual labels.
mi(predicted_labels, output_lb.inverse_transform(Y_test))









    Out[84]:





0.016610553650984284

By the majority-consensus rule, and using mutual information as the metric for scoring, things look not so bad! As mentioned above, the RandomForestClassifier is a pretty powerful method for finding non-linear patterns between features and class labels.

Uncomment the cell below if you want to try the scikit-learn's ExtraTreesClassifier.



In [85]:

    
# et = ExtraTreesClassifier()
# et.fit(X_train, Y_train)
# predictions = et.predict(X_test)
# predicted_labels = output_lb.inverse_transform(predictions)
# mi(predicted_labels, output_lb.inverse_transform(Y_test))

As a demonstration of how this model can be used, let's look at the ambiguously labeled sequences, i.e. those from "Environment" and "Avian", to see whether we can make a prediction as to what host it likely came frome.



In [86]:

    
# unknown_hosts = unknowns.ix[:,:-1].values

# preds = rf.predict(unknown_hosts)
# output_lb.inverse_transform(preds)

Alrighty - we're now ready to try out a neural network! For this try, we will use lasagne and nolearn, two packages which have made things pretty easy for building neural networks. In this segment, I'm going to not show experiments with multiple architectures, activations and the like. The goal is to illustrate how easy the specification of a neural network is.

The network architecture that we'll try is as such:

1 input layer, of shape 6750 (i.e. taking in the columns as data).
1 hidden layer, with 300 units.
1 output layer, of shape 140 (i.e. each of the class labels).



In [109]:

    
from lasagne import nonlinearities as nl
net1 = NeuralNet(layers=[
        ('input', layers.InputLayer),
        ('hidden1', layers.DenseLayer),
        #('dropout', layers.DropoutLayer),
        #('hidden2', layers.DenseLayer),
        #('dropout2', layers.DropoutLayer),
        ('output', layers.DenseLayer),
    ],
        # Layer parameters:
        input_shape=(None, X.shape[1]),
        hidden1_num_units=300,
        #dropout_p=0.3,
        #hidden2_num_units=500,
        #dropout2_p=0.3,
        output_nonlinearity=nl.softmax,
        output_num_units=Y.shape[1],
        #allow_input_downcast=True,

        # Optimization Method:
        update=nesterov_momentum,
        update_learning_rate=0.01,
        update_momentum=0.9,
                
        regression=True,
        max_epochs=100,
        verbose=1
        )

Training a simple neural network on my MacBook Air takes quite a bit of time :). But the function call for fitting it is a simple nnet.fit(X, Y).



In [110]:

    
net1.fit(X_train, Y_train)









    



# Neural Network with 197792 learnable parameters

## Layer information

  #  name       size
---  -------  ------
  0  input       566
  1  hidden1     300
  2  output       92

  epoch    train loss    valid loss    train/val  dur
-------  ------------  ------------  -----------  -----
      1       0.00935       0.00808      1.15611  0.43s
      2       0.00801       0.00790      1.01385  0.43s
      3       0.00794       0.00787      1.00938  0.43s
      4       0.00791       0.00785      1.00878  0.43s
      5       0.00789       0.00782      1.00854  0.43s
      6       0.00786       0.00779      1.00864  0.43s
      7       0.00781       0.00774      1.00965  0.43s
      8       0.00774       0.00763      1.01448  0.43s
      9       0.00749       0.00708      1.05802  0.43s
     10       0.00616       0.00540      1.14105  0.43s
     11       0.00551       0.00533      1.03429  0.43s
     12       0.00549       0.00531      1.03277  0.43s
     13       0.00548       0.00531      1.03249  0.43s
     14       0.00547       0.00530      1.03227  0.43s
     15       0.00547       0.00530      1.03210  0.43s
     16       0.00546       0.00529      1.03197  0.43s
     17       0.00546       0.00529      1.03186  0.43s
     18       0.00546       0.00529      1.03178  0.43s
     19       0.00545       0.00529      1.03171  0.43s
     20       0.00545       0.00528      1.03164  0.42s
     21       0.00545       0.00528      1.03158  0.41s
     22       0.00545       0.00528      1.03152  0.41s
     23       0.00544       0.00528      1.03146  0.41s
     24       0.00544       0.00528      1.03141  0.41s
     25       0.00544       0.00527      1.03136  0.41s
     26       0.00544       0.00527      1.03131  0.41s
     27       0.00544       0.00527      1.03127  0.42s
     28       0.00543       0.00527      1.03123  0.43s
     29       0.00543       0.00527      1.03120  0.43s
     30       0.00543       0.00527      1.03116  0.43s
     31       0.00543       0.00526      1.03113  0.43s
     32       0.00543       0.00526      1.03111  0.43s
     33       0.00542       0.00526      1.03108  0.43s
     34       0.00542       0.00526      1.03106  0.43s
     35       0.00542       0.00525      1.03105  0.43s
     36       0.00541       0.00525      1.03105  0.43s
     37       0.00541       0.00525      1.03105  0.43s
     38       0.00541       0.00524      1.03107  0.43s
     39       0.00540       0.00524      1.03110  0.43s
     40       0.00540       0.00523      1.03114  0.42s
     41       0.00539       0.00523      1.03120  0.42s
     42       0.00538       0.00522      1.03127  0.42s
     43       0.00537       0.00521      1.03136  0.41s
     44       0.00536       0.00520      1.03145  0.41s
     45       0.00535       0.00519      1.03151  0.41s
     46       0.00534       0.00518      1.03150  0.41s
     47       0.00533       0.00517      1.03142  0.41s
     48       0.00532       0.00516      1.03128  0.41s
     49       0.00531       0.00515      1.03112  0.41s
     50       0.00530       0.00514      1.03100  0.41s
     51       0.00529       0.00514      1.03092  0.41s
     52       0.00529       0.00513      1.03086  0.41s
     53       0.00528       0.00513      1.03083  0.41s
     54       0.00528       0.00512      1.03082  0.41s
     55       0.00528       0.00512      1.03081  0.41s
     56       0.00527       0.00511      1.03081  0.41s
     57       0.00527       0.00511      1.03082  0.41s
     58       0.00526       0.00510      1.03082  0.41s
     59       0.00525       0.00510      1.03083  0.41s
     60       0.00525       0.00509      1.03082  0.41s
     61       0.00524       0.00508      1.03079  0.41s
     62       0.00523       0.00507      1.03073  0.41s
     63       0.00522       0.00506      1.03060  0.41s
     64       0.00521       0.00505      1.03042  0.41s
     65       0.00520       0.00505      1.03024  0.41s
     66       0.00519       0.00504      1.03008  0.41s
     67       0.00518       0.00503      1.02997  0.41s
     68       0.00518       0.00503      1.02990  0.41s
     69       0.00518       0.00503      1.02986  0.41s
     70       0.00517       0.00502      1.02983  0.41s
     71       0.00517       0.00502      1.02980  0.41s
     72       0.00517       0.00502      1.02977  0.41s
     73       0.00517       0.00502      1.02973  0.41s
     74       0.00517       0.00502      1.02969  0.41s
     75       0.00517       0.00502      1.02964  0.41s
     76       0.00517       0.00502      1.02959  0.41s
     77       0.00516       0.00502      1.02953  0.41s
     78       0.00516       0.00502      1.02947  0.41s
     79       0.00516       0.00501      1.02941  0.41s
     80       0.00516       0.00501      1.02935  0.42s
     81       0.00516       0.00501      1.02929  0.42s
     82       0.00516       0.00501      1.02923  0.42s
     83       0.00516       0.00501      1.02917  0.43s
     84       0.00516       0.00501      1.02911  0.42s
     85       0.00516       0.00501      1.02905  0.41s
     86       0.00516       0.00501      1.02899  0.41s
     87       0.00516       0.00501      1.02894  0.41s
     88       0.00516       0.00501      1.02888  0.41s
     89       0.00515       0.00501      1.02883  0.41s
     90       0.00515       0.00501      1.02877  0.41s
     91       0.00515       0.00501      1.02872  0.41s
     92       0.00515       0.00501      1.02865  0.41s
     93       0.00515       0.00501      1.02860  0.41s
     94       0.00515       0.00501      1.02854  0.41s
     95       0.00515       0.00501      1.02848  0.41s
     96       0.00515       0.00501      1.02843  0.41s
     97       0.00515       0.00501      1.02839  0.41s
     98       0.00515       0.00501      1.02834  0.41s
     99       0.00515       0.00501      1.02829  0.41s
    100       0.00515       0.00501      1.02824  0.41s






    Out[110]:





NeuralNet(X_tensor_type=None,
     batch_iterator_test=<nolearn.lasagne.base.BatchIterator object at 0x7fa7e19c5d30>,
     batch_iterator_train=<nolearn.lasagne.base.BatchIterator object at 0x7fa7e19c5cc0>,
     custom_score=None, hidden1_num_units=300, input_shape=(None, 566),
     layers=[('input', <class 'lasagne.layers.input.InputLayer'>), ('hidden1', <class 'lasagne.layers.dense.DenseLayer'>), ('output', <class 'lasagne.layers.dense.DenseLayer'>)],
     loss=None, max_epochs=100, more_params={},
     objective=<function objective at 0x7fa7e19b28c8>,
     objective_loss_function=<function squared_error at 0x7fa7e1b44bf8>,
     on_epoch_finished=[<nolearn.lasagne.handlers.PrintLog object at 0x7fa7d65ec860>],
     on_training_finished=[],
     on_training_started=[<nolearn.lasagne.handlers.PrintLayerInfo object at 0x7fa7d65ecf28>],
     output_nonlinearity=<function softmax at 0x7fa7e206c6a8>,
     output_num_units=92, regression=True,
     train_split=<nolearn.lasagne.base.TrainSplit object at 0x7fa7e19c5d68>,
     update=<function nesterov_momentum at 0x7fa7e1b52488>,
     update_learning_rate=0.01, update_momentum=0.9,
     use_label_encoder=False, verbose=1,
     y_tensor_type=TensorType(float32, matrix))

Let's grab out the predictions!



In [111]:

    
preds = net1.predict(X_test)
preds.shape









    Out[111]:





(6350, 92)

We're going to see how good the classifier did by examining the class labels. The way to visualize this is to have, say, the class labels on the X-axis, and the probability of prediction on the Y-axis. We can do this sample by sample. Here's a simple example with no frills in the matplotlib interface.



In [112]:

    
import matplotlib.pyplot as plt

%matplotlib inline

plt.bar(np.arange(len(preds[0])), preds[0])









    Out[112]:





<Container object of 92 artists>

Alrighty, let's add some frills - the class labels, the probability of each class label, and the original class label.



In [113]:

    
### NOTE: Change the value of i to anything above!
i = 111
plt.figure(figsize=(20,5))
plt.bar(np.arange(len(output_lb.classes_)), preds[i])
plt.xticks(np.arange(len(output_lb.classes_)) + 0.5, output_lb.classes_, rotation='vertical')
plt.title('Original Label: ' + output_lb.inverse_transform(Y_test)[i])
plt.show()
# print(output_lb.inverse_transform(Y_test)[i])

Let's do a majority-consensus rule applied to the labels, and then compute the mutual information score again.



In [114]:

    
preds_labels = []
for i in range(preds.shape[0]):
    maxval = max(preds[i])
    pos = list(preds[i]).index(maxval)
    
    preds_labels.append(output_lb.classes_[pos])

mi(preds_labels, output_lb.inverse_transform(Y_test))









    Out[114]:





0.001634874148176865

With a score of 0.73, that's not bad either! It certainly didn't outperform the RandomForestClassifier, but the default parameters on the RFC were probably pretty good to begin with. Notice how little tweaking on the neural network we had to do as well.

For good measure, these were the class labels. Notice how successful influenza has been in replicating across the many different species!



In [115]:

    
output_lb.classes_









    Out[115]:





array(['American_Black_Duck', 'American_Green_Winged_Teal',
       'American_Widgeon', 'American_Wigeon', 'Bantam', 'Barnacle_Goose',
       'Black_Duck', 'Black_Headed_Gull', 'Black_Scoter',
       'Blue_Winged_Teal', 'Blue_Winged_Teal_', 'Bufflehead',
       'Canada_Goose', 'Canvasback', 'Chicken', 'Cinnamon_Teal',
       'Common_Coot', 'Common_Eider', 'Common_Goldeneye', 'Common_Murre',
       'Dog', 'Domestic_Cat', 'Double_Crested_Cormorant', 'Duck', 'Dunlin',
       'Eurasian_Wigeon', 'Ferret', 'Gadwall', 'Garganey', 'Glaucous_Gull',
       'Goose', 'Greater_White_Fronted_Goose', 'Grebe',
       'Green_Winged_Teal', 'Guineafowl', 'Gull', 'Herring_Gull',
       'Hooded_Merganser', 'Human', 'Knot', 'LAB', 'Large_Cat',
       'Laughing_Gull', 'Lesser_Scaup', 'Long_Tailed_Duck', 'Mallard',
       'Mallard_Black_Duck_Hybrid', 'Mallard_Duck', 'Mew_Gull', 'Mink',
       'Murre', 'Muscovy_Duck', 'Northern_Pintail', 'Northern_Shoveler',
       'Ostrich', 'Pacific_Golden_Plover', 'Panda', 'Pelican', 'Pig',
       'Pigeon', 'Pintail', 'Poultry', 'Quail', 'Red_Crested_Pochard',
       'Red_Knot', 'Red_Necked_Stint', 'Redhead', 'Redhead_Duck',
       'Ring_Billed_Gull', 'Ring_Necked_Duck', 'Rosy_Billed_Pochard',
       'Ruddy_Turnstone', 'Sanderling', 'Semipalmated_Sandpiper',
       'Shelduck', 'Shorebird', 'Shoveler', 'Sloth_Bear', 'Snow_Goose',
       'Sparrow', 'Swan', 'Swine', 'Teal', 'Tufted_Duck', 'Turkey',
       'Velvet_Scoter', 'Waterfowl', 'Weasel', 'White_Fronted_Goose',
       'White_Rumped_Munia', 'Widgeon', 'Wood_Duck'], 
      dtype='<U27')

The biology behind this dataset.

A bit more about the biology of influenza.

If you made it this far, thank you for hanging on! How does this mini project relate to the biology of flu?

As the flu evolves and moves between viral hosts, it gradually adapts to that host. This allows it to successfully establish an infection in the host population.

We can observe the viral host as we sample viruses from it. Sometimes, we don't catch it in its adapted state, but it's un-adapted state, as if it had freshly joined in from its other population. That is likely why some of the class labels are mis-identified.

Also, there are environmentally sampled isolates. They obviously aren't simply replicating in the environment (i.e. bodies of water), but in some host, and were shed into the water. For these guys, the host labels won't necessarily match up, as there'll be a stronger signal with particular hosts - whether it be from ducks, pigs or even humans.

Next steps?

There's a few obvious things that can be done.

Latin hypercube sampling for Random Forest parameters.
Experimenting with adding more layers, tweaking the layer types etc.

What else might be done? Ping me at ericmajinglong@gmail.com with the subject "neural nets and HA". :)



In [ ]: